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A significant theoretical advantage of search-and-score methods for learning 
Bayesian Networks is that they can accept informative prior beliefs for each pos- 
sible network, thus complementing the data. Currently however, there are limited 
practical ways of assigning priors to each possible network. In this paper, we 

■ present a method for assigning priors based on beliefs on the presence or absence 
| of certain paths in the true network. Such beliefs correspond to knowledge about 

Q . the possible causal and associative relations between a pair of variables X and 

Y . This type of knowledge naturally arises from prior experimental and observa- 
tional datasets, among others. We show that incorporating such prior knowledge 
may not only improve the learning of the direction of the causal relations in the 
network, but also the learning of the network skeleton. This is particularly the 
case when sample size is low and thus prior knowledge increases in importance. 

■ Our approach is based on converting possibly-incoherent beliefs about marginals 
\Q ' to joint distributions of priors by use of optimization theory. 

ov 

o: 

1 Introduction 

One theoretical advantage of the search-and-score approach to learning Bayesian Networks H) ver- 
sus the constraint-based approach is that the former naturally accepts priors for each network. 
Since the number of possible networks is super-exponential to the number of variables, in a practical 
setting one has to assign priors in an implicit way, avoiding enumeration of all structures. For exam- 
ple, one could devise an easily-computable function for the prior given a network. In addition, prior 
network probabilities have to be assigned so that they reflect our prior knowledge on the domain. 

In this paper, we present a method that accepts users' beliefs (probabilities) regarding the possible 
paths between a set of pairs of variables (X,Y). Paths between variables directly correspond to 
causal or associative relations, e.g., X causes Y, X and Y do not cause each other but have a 
common ancestor, or X and Y are statistically associated. For each possible network, the method 
can efficiently compute its prior corresponding to these input beliefs. It can thus be employed by a 
search algorithm trying to maximize the score of a network. 

Causal knowledge is naturally derived from prior experimental data while associative knowledge 
from observational data. For example, consider a dataset T> measuring the average amount of ex- 
ercise per week E, calcium in diet C, occurrence of osteoporosis by 60yrs O and smoking S in 
a cohort of women. A Bayesian Network could be induced by any appropriate learning method. 
However, if a prior experimental study showed that increasing the amount of exercise, reduces the 
occurrence of the disease, then the knowledge fact that [E causes (i.e., causally affects) O] with 
probability p should be incorporated during learning. Similarly, if a prior cohort study (observa- 
tional study) has shown that smoking correlates with reduced exercising then knowledge [S and E 
are associated] with probability p' should also be included. The belief strengths p and p 1 depend on 
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several factors, such as the statistical power of the study, the p-values, and the quality of the prior 
studies. Notice that the fact [E causes O] does not correspond to the presence of the edge E —> O 
in the network: the edge implies a direct causal relation while [E causes O] does not depend on the 
context of modeled variables. 

In simulated proof-of-concept experiments we show that the new scoring method can indeed take 
advantage of prior knowledge. When provided with causal knowledge, it is able to better learn the 
orientations of the edges and the causal relations. For example, let us assume that one learns from the 
data the Markov-Equivalence class of the Bayesian Networks (called the Partially Directed Acyclic 
Graph (PDAG) or the essential graph) Q with the maximum likelihood to be X — Y — Z. When 
given prior knowledge that [X causes Z] with high probability, the network X —> Y — >• Z obtains a 
higher a posteriori probability than all other networks in the PDAG. In addition, informative priors 
can also facilitate learning the skeleton of the network; intuitively, prior belief that X and Y are 
associated tends to induce the true edges that connect the two variables. 

One important technical difficulty in the proposed method is that of computing the joint distribution 
of the input path beliefs, e.g., computing P{X causes Y, Y causes Z) given P(X causes Y) = 0.8 
and P(Y causes Z) = 0.8. On one hand, there may be several choices for the joint given the same 
marginal beliefs. For example, in the above scenario we can infer P(X causes Y, Y causes Z) G 
[0.6,1]. Thus, path beliefs are inherently dependent. On the other hand, the beliefs maybe incoherent 
O, i.e,. not extendable to a joint distribution that satisfies the probability axioms. We present a 
method that computes a joint distribution of the path properties such that: when the path beliefs are 
coherent the joint is the closest to uninformative priors; when the input beliefs are incoherent the 
paths' joint is chosen coherent and induces path probabilities that are the closest to the input beliefs. 
Once the joint is computed, it can be employed to compute the prior of a network, e.g., the prior of 
X — > Y — > Z is proportional to P(X causes Y. Y causes Z). 

There are currently several other methods that make use of prior knowledge when learning a net- 
work, e.g., using knowledge regarding the parameters of the network H, a causal total order of 
the variables [ill) (i.e. totally ruling out all networks that to not admit the given total order), or the 
presence or absence of directed edges in the network [5] possibly with beliefs assigned to them 
(6]|7). Directed edges correspond to direct causal relations, i.e,. relations not mediated by any other 
variable in the model. Being "direct" depends on the context, i.e., the modeled variables. Such 
knowledge does not naturally arise from other sources such as past datasets or even expert opinion. 
Other work represents prior knowledge in the form of a prior Bayesian Network: prior probabili- 
ties are assigned based on the distance from this network (8j. Again, it is highly unlikely that such 
complete prior knowledge is available in a domain to construct this prior network. In general, it can 
be argued that the type of knowledge the existing methods can incorporate during learning is not in 
a form that can be easily acquired. As a result, uniform - and thus uninformative - priors are com- 
monly used when learning Bayesian Networks from data. The problem of incorporating informative 
priors while learning is listed in the list of open problems in a recent causality editorial [9| 

Prior work that specifically considers the problem of path constraints or beliefs is ifTUl [TT il. The 
method in [ 10] assumes one first learns a Markov-Equivalence class of Maximal Ancestral Graphs 
(a generalization of Bayesian Networks that admits hidden variables) [2 1 from data and then, prior 
knowledge in the form of path constraints is imposed on the graph. In contrast, in this work the 
network is learnt with the help of the prior knowledge. Second, in these works the path priors consist 
of hard constraints that do not admit degrees of belief. In [ 1 1 1 a method is presented for incorporating 
beliefs on paths, but relies on computationally expensive Markov Chain - Monte Carlo (MCMC) 
simulations. However, neither the latter, nor any other method dealing with prior knowledge J6][7] 
deals with the issues of dependent, and possibly, incoherent beliefs. 

2 Background 

We assume the reader's familiarity with Bayesian Networks |[T2l[T3l corresponding learning algo- 
rithms and just briefly review the basic concepts. Let V be a set of n random variables V\ , . . . , V n . 
In the rest of the paper, we assume discrete variables but the method applies to any type of variables. 
A Bayesian Network (BN) over V is a pair B = (Qv,Vv), where Gy is a Directed Acyclic Graph 
(DAG) representing conditional independencies between variables V, and Py is the joint distribution 
ofV. The graph and distribution must be connected by the equation Py = J| P(Vi|Pag(l^)), where 
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Pag(Vi) are the parents of Vi in Q. The above equation is equivalent to what is called the Markov 
Condition. When the network is fixed in a context we drop the indexes V, Q from the equations. 

The skeleton of a Bayesian Network Q is the undirected graph which can be constructed by ignoring 
the orientations of Q. A triple of vertices (X, Y, Z) is called a collider in Q, if X — >• Y <— Z is 
in Q. A collider (X, Y, Z) is unshielded if X and Z are not adjacent in Q. Two BNs are called 
Markov equivalent if: (a) they have the same skeleton, and (b) they have the same set of unshielded 
colliders. A Partially Directed Acyclic Graph (PDAG) (also known as essential graph) is a graph 
representing a set of Markov equivalent BNs. It has the same skeleton as all BN representatives and 
an edge is directed if and only if it is invariant in all BN representatives, and is undirected otherwise. 
We call a directed path from X to Y (denoted as X => Y) in a graph a sequence of unique edges 
and nodes in the graph X — > V\ — > . . . — > Vj ■ — > Y. We denote as X Y the case where there 
is a distinct node Z 6 V that is a common ancestor of X and Y (i.e., X <= Z Y) but neither X 
is an ancestor of Y nor the reverse. A d-connecting path (given the empty set) between X and Y 
exists if either X =>■ Y, X •<= Y, or X Y. The absence of a e?-connecting path between X and 
Y is denoted as X Y . In the rest of the paper, we assume the Faithfulness Condition (2) that 
(together with the Markov Condition) implies that there is a d-connecting path between X and Y, 
if and only if the two nodes are statistically associated (dependent). This assumption is important 
only when considering associative priors. 

Assume we are given a complete multinomial dataset T> over variables V. The probability of a 
network (or model) Q over V is P{G\D) = P[D f^ [G) oc P{D\G) ■ P{G). Taking the logarithm 
of each side we obtain log P(G\D) oc \ogP(D\G)+\ogP(G). The first term is the log likelihood of 
the data given the graph, while the second the log of the prior of the graph. The graph that maximizes 
\ogP(G\D) also maximizes the right-hand side. Bayesian scoring methods such as BDe, BDeu, [8] 
and K2 JT| try to approximate the log-likelihood based on different assumptions. Thus, in general 
all such scoring methods can be decomposed as: 

Sc(G\D) = Sc(D\G) + Sc(G) (1) 

When priors are uniform the term Sc(G) can be ignored during maximization. In our setting how- 
ever, this term may become important. 

3 Representing Prior Path Beliefs 

For any pair X,Y £ V we may have a prior belief on the possible paths connecting the two variables 
in the network. It is important that we devise cases for such paths that are mutually exclusive and 
allow the representation of common types of causal and associative knowledge. This is possible 
as follows: we define the variables n,j taking values in the set <=, with the semantics 

Vi => Vj, Vi <= Vj, Vi <^ Vj, and V. <£■ Vj respectively. When the specific variables V., Vj we refer 
to are not important we will use a single index: rfe. The input K (knowledge) to our method is a set 
of prior distributions for some variables rij. An example is shown in Table [TaT Top) expressing the 
belief that most likely there is a directed path from X to Y and from Y to Z. 

The possible paths between variables dictate their possible causal and associative relations. When 
the Bayesian Network is interpreted causally, then X Y is equivalent to [X causes Y]. In 
addition, as discussed in the previous section: X^YorX-^YorX^Yis, equivalent to [X is 
associated with Y], Thus, a distribution P rxy = (tt^,,it^, 71^,11^} corresponds to the following 
beliefs about the causal and associative relations: 

P(X causes Y) = tt^ P(X does not cause Y) = tt^ + + n& 

P(X associated with Y) = tt^. + + tt^ P(X not associated with Y) = ir^ 

In practice, it is useful to allow the user to specify prior beliefs directly on the events [X (not) causes 
Y] and [X is (not) associated with Y] from which the distribution P rxY can be derived, than the 
opposite. This is not difficult: for example given P(X causes Y) = tt^, the mass of probability 
1 — 7r^> has to be distributed in a reasonable way to the other three values ir^ , ir^ , ir^ . However, 
we avoid this belief representation to simplify the presentation of the method. 
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Table 1: (a) (Top Part) Prior beliefs K regarding the paths between three pairs of variables. The 
beliefs are incoherent: P(X =>Y) = 0.8 and P(Y =>■ Z) = 0.9 imply that P(X => Z) G [0.7, 1]. 
(a) (Bottom Part) Induced coherent beliefs K' stemming from K by solving the quadratic program 
in Eq. [10] (b) A part of the joint probability distribution J computed by solving Eq. [TU]with input 
K. The number of DAGs with 5 nodes for each configurations Nc is also shown. The total number 
of DAGs with 5 vertices is N — 29281. The total number of configurations is 4 3 — 64. Notice that 
C2 and C3 have both zero counts and zero probability, because they are invalid. 
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4 Computing Priors and Scores 

In this section, we derive a score Sc(G\D, K) for a network graph G given data D and n prior 
distributions on paths beliefs in K. An important requirement for the computation of the score is 
knowledge of a joint distribution J = P(ri, . . . , r n ) = P(r) such that its marginals correspond to 
the distributions in K. J assigns a probability value to each of the 4" possible joint instantiations of 
values to variables r = (n, . . . r n ). We denote with C (configuration) such a given joint instantiation 
and define 

p c = P(v = C\J) 

In this section, we assume J is already computed; the next section describes the details of this 
computation. The joint J stemming from K in Table [TaT Top) is shown in Table [Tbl It is important to 
notice that for each graph G the configuration C is uniquely determined. For example, in the joint 
of Table [Tbl if in a graph G it holds X ^ Y, Y ^ Z, X ^ Z then r = G\. Thus, it makes sense 
to denote with Cq the joint instantiation of variables r in graph G. 

Let G be a Bayesian Network graph and D a dataset over the same variables. We now compute the 
probability P(G\D, J): 

_ P{D\G, J) ■ P(G\J) _ P(D\G) ■ P(G\J) 
F{GlD > J > ~ PjD\J) - PjD\J) 

The second equation stems from the fact that given the graph G the data D are independent of J 
(J does not provide any additional information about the data once the graph is known). The factor 
P(D\J) is a normalizing constant that does not need be computed when we maximize the above 
equation over different graphs. The factor P(D\G) is the likelihood of the data given the graph; in 
Section|2]we mention several approximations (e.g., BDeu) based on different set of assumptions for 
each computation. We now focus on the prior P(G\ J): 

P(G\J) = Y J P{G,C\J) = P(G,C G \J) 

c 

The last equation holds because P(G, C\J) equals zero for all C ^ Cq, since each graph entails 
exactly one configuration. Subsequently: 

P(G\J) = P(G, C G \J) = P(G\J, C G ) ■ P{C G \J) = P{G\C G ) ■ P{C G \J) = P(G\C G ) ■ PCg 

The factor P(G\C G ) is our prior on a graph G given that a specific configuration holds. Given no 
other preference or knowledge we assign the same ( uniform) prior to all graphs with the same con- 
figuration. Thus, letting N G be the number of graphs over nodes V sharing the same configuration 
C then P(G\C G ) = 1/N Gg and so : 

P(G\J) = j¥ L and Sc(G\J) = \og PCG -\ogN CG (2) 
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Similarly to Eq. Q]the overall score of a graph is: 

Sc(G\D, J) — Sc(D\G) + Sc(G\J) (3) 
The score Sc(G\D, J) has two desirable properties: 

1 . Markov-Equivalent graphs that satisfy the same path-beliefs obtain the same score. 

The last term in the equation above is the same for graphs sharing the same configura- 
tion. The first term is the same for Markov-equivalent graphs provided one employs an 
appropriate scoring function, such as the BDe and BDeu scores (8). 

2. For uninformative prior beliefs, all graphs are equiprobable, i.e., P(G\J) = l/N, 
where N is the number of graphs over nodes V. With uninformative beliefs we expect 
to encounter a given configuration with probability equal to the proportion of the graphs 
satisfying the configuration, i.e,. pc — ^w~- In that case, P(G\ J) = • Nc — and 
we end up with uniform priors as we would expect. 

While Eq. [2] follows the above two properties, we point out to the fact that the factor 1 /Nc G 
may seem to provide counter-intuitive results at a first glance. Let's assume that for configurations 
C\ , C*2, the following holds: p\ =0.6 and p2 = 0.2. In other words, the prior beliefs state that it is 
3 times more probable a priori that the true graph has configuration C\ than C 2 . Now, let us assume 
that N\ = 60 and N 2 = 10 and let G\, G2 be two graphs consistent with configurations Ci, C 2 
respectively. We then obtain: 

P(Gi|J) _ pi ■ N 2 _ 0.6 x 10 _ 1 
P{G 2 \J) ~ P2 ■ Ni ~ 0.2 x 60 ~ 2 

Thus, any graph consistent with C2 has twice the prior than any graph in C\ . This may seem counter- 
intuitive since the user has specified that C\ is 3 times more likely to be encountered than C 2 . This 
is true considering the total probability mass of C\ and C 2 . However, since this mass is distributed 
over more graphs consistent with G\ than C 2 , each individual graph in the first configuration is less 
probably than any graph in the second configuration. 

The implications of the above observation is that, everything else being equal, higher priors will 
tend to be assigned to graphs in "small" configurations, i.e., consistent with only a few graphs . If 
this behavior is not desirable then one can drop the 1/Nc factor and use: 

P(G\J)= PCg and Sc(G\J) — logp CG (4) 

However, if this score is used in place of Eq. |2]then Property [2] above is not satisfied any more. 

Computing the number of graphs Nc- The number N of DAGs over nodes V has been solved 
in closed-form [ 14 1. However, there is no closed-form to the best of our knowledge for the number 
Nc of DAGs that satisfy certain path-constraints. When the number of nodes is small (up to 5- 
6) one can enumerate all DAGs and compute each Nc for each configuration C by counting. The 
number of possible DAGs however, grows super-exponentially to the number of nodes and complete 
enumeration is not an option. In this case, we estimate these counts by sampling a number S of 
random DAGs with uniform probability. Specifically, we implemented the recent method in |fT31 
that unlike prior work |fl6l , avoids the use of expensive Markov-Chain, Monte-Carlo methods to 
ensure uniform sampling from the space of DAGs. Nc can be estimated as ^N, where Sc is the 
number of sampled DAGs that conform to configuration C. When the number of configurations is 
large or Nc/N is small one may never sample any graph consistent with C. To avoid zero estimates, 
we apply the Laplace correction: Nc = gl^ N, where c is the number of configurations and I an 
arbitrary parameter (we use the value I = 1). 

5 Computing the Joint Distribution J given Prior Path Beliefs K 

Eq. [2] shows how to compute the prior probability of a graph given the joint distribution J of 
path beliefs r. In this section, we show how to compute J given the marginal beliefs on paths 
involving pairs of variables stored in K. We denote with irkj the probability that takes value 

fl"fc,j = P{r k = j) 
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(a) (b) 



Figure 1: We assume the prior beliefs K in Table [Taf Top) and the corresponding J in Table [Tbl (a) 
The configuration C x = {X Y, Y => Z, X =>■ Z} holds in the graph. Forp 1 = 0.5068 (see Table 
HB we obtain the score Sc(G\K) = log(0.5068) - log(2800) = -8.6171. (b) The configuration 
C 49 = {X <fr Y, Y =>• Z, X =>• Z} holds in the graph. For p 49 = 0.0244 we obtain the score 
Sc(G\K) = log(0.0244) - log(1045) = -10.6662. As expected, the first graph has a higher prior 
than the second one since X Y is given a higher probability than X Y in Table fTaT Top). 



The values n are provided in K. The unknown quantities are pc for each configuration C in J. Let 
Ck.j — {C, s.t. = j}, i.e., the set of configurations where variable obtains value j. For each k 
and j we obtain the following constraints: 

In other words, the marginals of the joint should equal our input path beliefs. An important obser- 
vation that is characteristic of this problem, is that path beliefs are not independent in general. For 
example if one believes with certainty X => Y => Z, then they have to believe X =>■ Z to be coher- 
ent. Thus, it is important to consider the following constraints, stemming from the path semantics 
of the variables r: 

Pc = 0, when C is invalid (6) 
By invalid we mean a configuration that cannot be satisfied by the graph of any Bayesian Network 
over V, e.g., it contains directed cycles. The algorithm to detect invalid configurations is discussed 
later. To complete the problem specification we impose that: 

J^Pc — 1 and p c > (7) 

c 

If constraints in Eqs. [5] [6l [7] can be satisfied then a joint distribution adhering to the probability 
axioms can be found such that the prior marginal path beliefs hold. In this case, by definition 
K is coherent, otherwise it is incoherent. Notice that all constraints together form a set of linear 
equations that is easy to solve or determine it has no (non-negative) solution. However, the number 
of unknowns pc equals 4™, where n are the input path beliefs and so the computational overhead 
increases exponential with n. 

Dealing with Coherent Beliefs. The systems of equations contains An constraints from Eq. [5] m 
constraints from Eq. [6] and 1 constraint from Eq. [7] and 4™ unknowns. For most typical problems, 
An + m + 1 <C 4" and so the system may have infinite solutions. We argue that one should choose 
a solution jpd J as close to the uninformative one as possible. Any other distribution may introduce 
bias towards certain configurations, even if the prior knowledge does not suggest preference over 
those configurations. In other words, if the uninformative jpd is a coherent extension of the prior 
knowledge, there is no reason to prefer any other solution over it. The problem can be formulated 
as follows: 

min y^fpfc -) 2 subject to constraints in Eqs. [5] ISJLZ] (8) 

p ^-^ N 

fc=i 

The quantity ^f, where Nk is the number of graphs consistent with configuration and N the total 
number of DAGs over V corresponds to the uninformative priors where each graph is equiprobable. 
The optimization problem of Eq. |8]is a quadratic program (quadratic objective function with linear 
constraints) and can be solved accurately and relatively efficiently (to the number of unknowns). 

Dealing with Incoherent Beliefs. In this case, there is no jpd that can equal the marginal input 
beliefs. Instead of requesting coherent beliefs or ignoring the incoherency, we seek for joints with 
marginals as close as possible to the user's input beliefs. The constraints in Eq. |5]are now modified 
to include slack variables i.e., the amount by which the original constraints are violated: 

7Tfc,j + SfcJ = ^2 PC (9) 
CGC fc , 3 
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True DAG: X->Y->Z True DAG: X->Y<-Z SHD on Alarm Network 




Sample Size Sample Size Sample Size 

(a) (b) (c) 

Figure 2: Proof-of-concept, experimental results, (a) Learning the orientations and the skeleton is 
facilitated by causal prior knowledge, (b) Learning the graph is facilitated by correct associative 
prior knowledge and hindered by incorrect priors, (c) Learning the ALARM network with 5 pieces 
of informative associative beliefs and without. 



This system of equations is always solvable; out of all solutions preference should be given to solu- 
tions that violate the original constraints the least, leading to the following optimization problem: 

min > sf + a ■ > (p k r-r ) 2 subject to constraints in Eqs. [9] 1312] (10) 

p,s ^ — ' ^ — ' A 

2 = 1 k=\ 

This problem tries to simultaneously minimize a trade-off between (a) the difference between the 
marginal probabilities and the user beliefs and, (b) the difference between the solution jpd and the 
uninformative jpd. The trade-off is controlled by the parameter a. For a — one finds a valid jpd 
so that its marginals are as close as possible to the input beliefs. For a = 4n/4™ (the ratio of terms 
in each summand) each summand is assigned equal importance (this is the value we employ in our 
experiments). Table [Tblcontains the joint J stemming from K of Table [TalT op) computed by solving 
Eq. [10] For comparison with the input beliefs K, Table [Tat Bottom) contains the marginal beliefs K' 
implied by J: -k[ j — + Si.j. The values in Table [TaTTop) and Table [Taf Bottom) are close, with 
the later one representing coherent beliefs. 

Determining Invalid Configurations. To identify all constraints in Eq. [6]we have implemented the 
following algorithm. For each configuration C, we construct a graph G' with nodes the variables 
that appear in at least one prior path belief. For each assignment rxy = " " or rxy = " <= " 
in C, we add the edge X — > Y or X <— Y respectively, in G". In addition, for each assignment 
rxy = " " we add a new dummy node Va to G' and add the edges X <— Vd — > Y. Configuration 
C is invalid in jthe following cases: (a) G" contains cycles, (b) for some X, Y in G", X has a directed 
path to Y and rxy = " " or rxy = " " in C, and (c) for some X, Y in G", X has a path to 
Y in G' (not necessarily directed) and rxy — " <£■ " in C. 

This algorithm is obviously sound, but it is not complete. A problem may arise when the number 
of dummy nodes added to G' exceeds the number of available nodes (variables) in the data. In that 
case, it may seem that a configuration is valid, but there may not be enough variables to satisfy all 
confounding <^> relations in the context of the remaining path constraints. The simplest example is 
a dataset with two variables X and Y: the configuration rxy = " ■O- " is invalid as there is no 
other variable to serve as common ancestor. Yet, the above cases will not identify it as such. A 
less trivial example is rxy — "•<=>■" and ryz = "=$■" when the only variables are X, Y, Z. 
Since rxy — " ^> " it has to be that X <= Z => Y which conflicts with r Y z — " => "■ Our 
intuition is that a complete algorithm requires solving a constraint satisfaction problem. However, 
when the number of variables in the data is large relative to the number of path beliefs (specifically 
if | Vdata | > | Vq' I holds), the algorithm becomes complete (proof omitted for space). 

6 Experimental Results 

Employing Causal Knowledge. We consider the graph X — > Y — > Z. As prior knowledge 
we set P(X =^ Z) = 0.9 and distribute the remaining 0.1 mass of probability to the remaining 
values of rxz proportional to the values that correspond to a uniform prior. We repeat the following 
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experiment 10000 times: (a) we randomly select the number of states for each variable to be either 
3 or 4, (b) we sample the cpts for each variable using the gamma distribution (gamrnd Matlab 
function with shape parameter A set to 0.5 and scale parameter B set to 1), (c) we sample a dataset 
of size 200 from the network given the previously sampled cpts, (d) we increase the samples of the 
dataset to provide to the scoring method from 10 to 200 with step size of 10, (e) we identify the 
highest scoring network out of all 25 possible DAGs using informative priors and the BDeu score 
with Equivalent Sample Size (ESS) set to 1 (see Eq. |3), (f) we similarly identify the highest scoring 
network with uniform priors. 

Results: Figure |2a]plots the percentage of the time the PDAG X — Y — Z of the true network was 
found exactly with and without informative priors. First notice, that when the true PDAG is found 
exactly, the edges are also always oriented correctly since the true network has a higher prior than 
any other Markov-equivalent graph. Perhaps more surprising though, notice that the informative 
priors also increase the learning of the skeleton. The belief X Z tends to add a path from X to 
Z. The associations X — Y and Y — Z are always higher than or equal to the association between 
X — Z (see 11171 ). Thus, it is the correct path X — Y — Z that tends to be induced, rather than any 
other network with a path X => Z. 

Employing Associative Knowledge. We run a similar proof-of-concept experiment where the 
true network is a single collider X — >• Y <— Z. We use the same settings as before for three 
cases: correct associative priors P(X <fr Z) = 0.9, uniform priors, and incorrect associative priors 
P(X associated with Z) = 0.9. 

Results: The results are shown in Figure [2b] As expected, correct prior beliefs clearly improve 
the chances of identifying the true PDAG; the effect is exactly the opposite when misleading, in- 
correct beliefs are provided to the algorithm. Of course, asymptotically the priors, whether correct, 
incorrect, or uninformative play no role. 

Learning Larger Networks. We sample 1000 datasets from the distribution of the ALARM net- 
work [18]. We learnt the network using greedy search-and-score with the typical operators add, 
delete, and reverse an edge, and the BDeu metric with ESS=1. We vary the sample size given to 
the algorithms within {50, 75, 100, 150, 200}. For each dataset, we randomly pick 5 pairs (X, Y) of 
variables on which to provide informative associative priors: if X <S> Y in the true network, we set 
P(X <J£> Y) = 0.9, otherwise, we set P(X <3>Y) = 0.1. We run search-and-score starting from the 
empty graph with and without the informative priors and compute the Structural Hamming Distance 
|[T9l from the true network. The simple search operators do not consider and neither exploit the path 
beliefs to improve optimization. We thus, also run the search-and-score algorithm starting from the 
true network to gauge the potential for improvement when a better search method is employed, that 
at some point visits the true network. 

Results: The results are shown in Figure [2c] In both cases, the SHD is smaller with the 
informative priors than with uniform priors. The differences in SHD for each sample size 
are always statistically significant (using a one-sample t-test), with p-value close to the ma- 
chine epsilon. For low sample sizes (50 and 75) the 95% confidence interval of the SHD 
differences are [10.0959, 11.7821], [6.1051, 7.2349] when starting from the empty graph, and 
[8.8170, 10.6630], [6.2721, 7.5399] when starting from the true graph. 

7 Discussion and Conclusions 

We present a method for computing informative priors given a set of causal and associative beliefs 
on pairs of variables. The priors can then be employed by any search-and-score learning algorithm. 
Such beliefs can be induced from prior experimental or observational studies respectively, among 
other sources. The method, for the first time, addresses the issues of incoherent priors and priors that 
are not independent. Providing correct priors about pairwise causal or associative relations improves 
learning both in terms of identifying the orientation of the edges (for causal priors), but also in terms 
of identifying the skeleton of the network. 

There are numerous issues to still address regarding both the method and the general problem. The 
algorithm computes a joint of prior beliefs that is exponential to the input (number of beliefs). More 
efficient algorithms that perform this operation implicitly are desirable. The search method for 
the optimal graph, in the context of informative priors becomes more complicated; typical greedy- 
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search with operators on the edges alone may not suffice. Complete and efficient algorithms for 
determining invalid configurations, as well as closed-form solutions for computing the number of 
graphs given path constraints are desirable. Finally, incorporating the strength of the causal effects 
or associations and other prior knowledge characteristics is an interesting future direction to pursue. 
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